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Relatedness in the post-genomic era: 
is it still useful? 

Doug Speed' and David J. Balding' ^ 

Abstract | Relatedness is a fundamental concept in genetics but is surprisingly hard to 
define in a rigorous yet useful way. Traditional relatedness coefficients specify expected 
genome sharing between individuals in pedigrees, but actual genome sharing can differ 
considerably from these expected values, which in any case vary according to the 
pedigree that happens to be available. Nowadays, we can measure genome sharing 
directly from genome-wide single-nucleotide polymorphism (SNP) data; however, 
there are many such measures in current use, and we lack good criteria for choosing 
among them. Here, we review SNP-based measures of relatedness and criteria for 
comparing them. We discuss how useful pedigree-based concepts remain today and 
highlight opportunities for further advances in quantitative genetics, with a focus on 
heritability estimation and phenotype prediction. 



Relatedness 

Two individuals are related if 
they have a recent common 
ancestor, where 'recent' 
can be variously defined 
as outlined under IBD 
(identity-by-descent). 
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Traditional measures of relatedness, which are based 
on probabilities of IBD (identity-by-descent) from 
common ancestors within a pedigree, depend on the 
choice of pedigree. However, in natural populations 
there is no complete pedigree or any optimal pedigree 
that could form the basis of a canonical definition of 
relatedness. Moreover, the random nature of recom- 
bination during meiosis means that expected genome 
sharing specified by IBD probabilities is an imprecise 
guide to actual genome sharing: human half-siblings 
are expected to share half of each chromosome that 
they received from their common parent, but the 95% 
credible interval for their actual amount shared ranges 
from 37% to 63% (see below). More recent approaches 
use genetic markers either to estimate IBD probabili- 
ties in an unobserved pedigree, of which founders are 
assumed to be unrelated, or to identify shared genomic 
regions that are unaffected by recombination since 
their most recent common ancestor (MRCA). These 
approaches also suffer from difficulties (see below). 
Although the problems in defining and measuring 
relatedness have been appreciated by some authors'"'', 
no better approach has gained widespread accept- 
ance, and the implications of different approaches for 
applications are rarely noted. 

Genome-wide single-nucleotide polymorphism 
(SNP) data now allow us to measure realized genome 
sharing with great accuracy and without reference 
to pedigree-based concepts, but this advance brings 



new problems. First, two haploid human genomes are 
typically identical at >99.9% of sites owing to shared 
inheritance from common ancestors. However, 
sequence identity across SNPs is much lower. SNP- 
based measures of genome similarity will depend sen- 
sitively on the minor allele fractions (MAFs) of the 
SNP set, which reflect both choice of SNP genotyping 
technology and the quality control procedures used. 
Even when the SNP set is fixed, there remain many 
ways to measure the similarity of genomes, and we 
lack criteria to choose among them. The usual statisti- 
cal criteria of bias and precision of an estimator are less 
useful for natural populations because of the lack of an 
interpretable relatedness parameter to be the target of 
estimation (see below). 

In this Review, we argue that IBD-based concepts 
of relatedness are now of limited value in genetics. 
Many previous uses can now be replaced by models 
and analyses that are based directly on actual genome 
similarity, although much work remains to be done to 
define concepts and to evaluate measures of genome 
similarity in the post-pedigree era. We begin with a 
recap and critique of pedigree-based relatedness, but 
our main focus is on SNP-based measures. Therefore, 
we complement a previous review that focused on 
pedigree-based relatedness^ We also discuss how 
relatedness can be defined in terms of genome-wide 
distributions of time since the MRCA (TMRCA). This 
seems to provide the most promising route to a 
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IBD 

(Identity-by-descent; also 
identical-by-descent). The 
phenomenon whereby two 
individuals share a genomic 
region as a result of inheritance 
from a recent common 
ancestor, where 'recent' can 
mean from an ancestor in a 
given pedigree, or with no 
intervening mutation event 
or with no intervening 
recombination event. 

Pedigree 

A set of individuals connected 
by parent-child relationships. 

Most recent common 
ancestor 

(MRCA). Although the 
ancestries of two alleles may 
both pass through the same 
individual, they pass through 
different alleles with 
probability 0.5, in which 
case that individual is not 
the MRCA of the alleles. 



Box 1 I Defining and measuring pedigree-based relatedness 

Retatedness is popularly understoocJ in terms of the 
shortest lineage path (or paths) linking two individuals; 
for example, cousins are linked by two lineage paths each 
of length four. However, pairs of individuals are linked by 
many lineage paths. Coefficients that incorporate all 
lineage paths within a specified pedigree have been 
developed since the 1940s"'"'". The kinship coefficient 
(or coancestry) of individuals B and C is the probability 
that two homologous alleles, one drawn from each of two 
individuals, are IBD (identical-by-descent). It can be 
computed as follows. 

1 + /a 



e(B,c)=2: 



(13) 



In this equation, the sum is over every most recent common ancestor (A) of B and C in the pedigree ('most recent' 
means that no descendant of A is also a common ancestor of B and C), is the number of parent-child links in the 
lineage path linking B and C via A, and/^ is the inbreeding coefficient of A, which equals the coancestry of its parents. 

In the pedigree shown (see the figure), assuming unrelated founders, half-siblings B and C have one common 
ancestor A with g^ = 2, so 9(6, C) = 1/2' = 1/8. Cousins B and D have two common ancestors E and F, with 9^ = 9^ = 4, 
so 9(6,D) = 2 X (1/2') = 1/16. If in fact E and F were related with 0(£,F) = 1/20, then/, = 1/20 and 0(6,C) = (1 + 1/20)/25 = 21/160. 

9(6,C) can be interpreted as the expected IBD fraction for two alleles at a locus, one each from B and C. There are 15 
possible IBD states for the four alleles of B and C, which reduce to 9 if we regard each individual's two alleles as 
unordered"', and reduce further to 3 (IBD = 0,1 or 2) \if^=f^ = 0 (REF. 88). 6 can then be expressed as follows, where 
0=P[IBD = l]and A = P[IBD = 2]. 



E[IBD] 



(14) 



4 4 2 

If A>0, then the relationship is bilineal and can help to assess the contribution of dominance to the genetic architecture 
of traits. Full siblings have A = 0.25 and so they have matching genotypes owing to the shared parents at ~25% of the 
genome. The variance in the IBD fraction"''" can also be useful in distinguishing between relatives: in an outbred pedigree 
(fj,=0 for allX), half-siblings, uncle-niece and grandparent-grandchild all have ((1 = 1/2 and A = 0, but uncle-niece have two 
common ancestors each with g^ = 3, which implies more but shorter regions shared IBD and therefore a lower variance 
than for the other two relationships (which have one common ancestor with g^ = 2). Early work on the distribution of 
lengths of IBD regions is referred to as the theory of junctions''. 



Time since the MRCA 

(TMRCA; in generations). 
If the times back to a common 
ancestor differ between two 
individuals, then the average 
is used. 

Heritability 

The proportion of phenotypic 
variation that can be attributed 
to any genetic variation 
(broad-sense heritability) or 
to additive genetic variation 
(narrow-sense heritability [h^]]. 

Lineage paths 

Sequences of parent-child 
steps linking individuals with 
length equal to the number 
of steps. 

Coancestry 

(fl). A kinship coefficient 
defined as the probability that 
two homologous alleles, one 
drawn from each of two 
individuals, are IBD 
(identicai-by-descent). 

Inbreeding coefficients 

The coancestries of the two 
parents of an individual. 



satisfactory conceptual definition of relatedness, but 
its practical usefulness has not yet been well explored. 
Finally, we review the use of relatedness in heritability 
estimation and phenotype prediction to capture the 
polygenic contribution to a complex trait, and discuss 
some implications of moving from pedigree-based to 
SNP-based measures of relatedness. 

Throughout this Review, we refer to two sets of 
simulations that both use the Decode Genetic Map*", 
which specifies male and female recombination rates 
over 2,667 Mb across the 22 autosomes. For simula- 
tions of Type A, we generate sequence data for pairs 
of individuals with one or two recent common ances- 
tors, and examine the fraction of DNA that the two 
individuals share IBD from the ancestor (or ances- 
tors). For Type B simulations, we generate sequence 
data from a Wright- Fisher population of 5,000 males 
and 5,000 females simulated over 50 generations from 
unrelated founders. The mating pattern is modified 
so that the probabilities for a female to have 0, 1, 2 or 
>2 children are 0.22, 0.20, 0.26 and 0.31, respectively, 
which is similar to Australian census data'; if two indi- 
viduals have the same mother, then the probability that 
they have the same father is -0.62 (see Supplementary 
information SI (box) ). 



Pedigree-based relatedness 

The classical theory of kinship coefficients based on 
lineage paths in pedigrees (BOX 1 ) provides a mathemati- 
cally beautiful structure that has historically been use- 
ful, but its weaknesses are apparent. Pedigree founders 
are typically assumed to be unrelated, but this is only 
realistic in certain settings, such as some designed breed- 
ing programmes or an isolated population created by a 
specific founding event. All pairs of individuals with no 
common ancestor in the pedigree have coancestry (6) 
of zero, but in practice they can have important differ- 
ences in genome similarity. To overcome these problems, 
it may seem desirable to seek ever-larger pedigrees but, 
if we continue to add additional ancestors to an existing 
pedigree, then the co- ancestries of the original pedigree 
members will continue to increase and will eventually 
converge to one, which would be useless in practice. 
The lack of a complete or an ideal pedigree means that 
the choice of pedigree and hence any resulting kinship 
values are arbitrary to some extent. Similarly, pedigree- 
based inbreeding coefficients are of limited value: they also 
converge to one as the pedigree information increases 
and therefore make sense only with respect to a trunca- 
tion of the pedigree, for example, at G generations before 
present* '. Even then, interpretabUity remains a problem 
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because generations are typically not well defined (mul- 
tiple lineage paths to a common ancestor can have differ- 
ent lengths) and G is arbitrary, yet shared DNA is treated 
very differently if the sharing originates in, for example, 
generation (G- 1) rather than generation (G-l- 1). 

TABLE 1 shows the variability of 6' — the realized IBD 
fraction from the specified common ancestors in a Type 
A simulation — about its expected value 6. Indeed, dif- 
ferences in 6' can be exploited to estimate narrow-sense 
heritability (h^) using pairs of individuals with the same 
6, such as siblings'"'" or even unrelated individuals (6=0) 
(REF. 1 2). This contrasts with traditional estimates that 
require pairs of individuals with different 6, such as 
monozygotic and dizygotic twin pairs". The table also 
reports P[(?'>0], the probability of any IBD from the 
specified common ancestors. For example, two children 
with a common great-grandmother (G = 3; A = 1) will 
each have substantial genome- sharing IBD with her but 
could, in effect, not be related to each other Although 
they are expected to share -100 Mb IBD over ~7 regions, 
there is a probability of 0.005 that they share no DNA 
from her, despite the pedigree relationship. The values 
for P[9'>0] in this case are similar to those in a previ- 
ous report" that assumed a sex-averaged human genetic 
map of 33 Morgans; we used 40.7 Morgans for women 
and 22.9 Morgans for men'^ For a genome of length 
L Morgans and when A = 1, we have the approximation' 
P[0'>O] = l-exp(-(2G- 1)1/2'=-'). Supplementary 
information S2 (box) illustrates, using a simple simula- 
tion, the relationship between the number of pedigree 
ancestors and ancestors that actually transmitted DNA 
to the current generation. 

FIGURE 1 demonstrates the potential impact of using 
6 (expected IBD) rather than d' (realized IBD) for 



estimation (see below). Using a Type B simulation, 
we generated phenotypes with pairwise correlations 
between the genetic contributions to each phenotype 
equal to 6' over all 50 generations (G= 50) of the simu- 
lation, and using this information when estimating fe' 
gives the best possible inferences (red). Inferences are 
less precise if we only have available 6' based on G= 10 
(green) or G = 5 (blue), and worse again when instead 
using Abased on G = 5, which corresponds to a complete 
5-generation pedigree (purple). Precision deteriorates 
as close relatives are progressively excluded from the 
analysis, particularly when using 6. However, reason- 
able estimation remains possible using 6' even when 
only distantly related individuals are considered. 

Relatedness with unobserved pedigree 

When no pedigree information is available, allelic 
correlations at genotyped markers have been used to 
estimate the pedigree-based coancestry. Many models 
of population genetics'" incorporate the following 
expression for the probability that two homologous 
alleles are both of type a (p^), where is the probability 
that one sampled allele is of type a. 



(1) 



This equation is based on the idea that, with prob- 
ability 9, the two alleles have the same source and so 
effectively reflect only one observed allele, which is a 
with probability p^. Otherwise, the two alleles are inde- 
pendent and are both a with probability p\ Defining 
L/j = 1 if the first allele is a, otherwise = 0, and simi- 
larly for the second allele, we have E[[/J =£[(7^] =p^, 
Var[[/J=Var[[/J=p__(l-p; and E[L/j[/J=pJ. 



Table 1 1 Properties of genomic regions shared IBD by two individuals from C generations in the past 



Relationship 


C 


A 


0=Em 


95% CI of 0' 


p[e'>o] 


E[#SR] 




Sibling 


1 


2 


0.25 = (l/2)2 


(0.204, 0.296) 


1.000 


85.9 


31.1(35.2) 


Half-sibling 


1 


1 


0.125 = (l/2)' 


(0.092,0.158) 


1.000 


42.9 


31.1(35.2)* 


First cousin 


2 


2 


0.062 = (1/2)* 


(0.038, 0.089) 


1.000 


37.5 


17.8 (21.5) 


Half-cousin 


2 


1 


0.031 = (l/2)^ 


(0.012,0.055) 


1.000 


18.8 


17.8 (21.5)* 


Second cousin 


3 


2 


0.016 = (l/2)' 


(0.004, 0.031) 


1.000 


13.3 


12.5(15.4) 


Half-second 
cousin 


3 


1 


0.008 = (1/2)' 


(0.001,0.020) 


0.995 


6.7 


12.5(15.4)* 


Third cousin 


4 


2 


0.004 = (1/2)" 


(0.000, 0.012) 


0.970 


4.3 


9.6(12.0) 


Half-third cousin 


4 


1 


0.002 = (1/2)' 


(0.000, 0.008) 


0.834 


2.2 


9.6(12.0)* 




5 


1 


(1/2)" 


(0.000, 0.004) 


0.431 


0.7 


7.9 (9.9) 




6 


1 


(1/2)" 


(0.000, 0.001) 


0.160 


0.2 


6.6 (8.4) 




8 


1 


(1/2)" 


(0.000, 0.000) 


0.015 


0.0 


5.1(6.5) 




10 


1 


(1/2)" 


(0.000, 0.000) 


0.001 


0.0 


4.1(5.3) 



CI, credible interval; SR, shared region. We consider only IBD (identity-by-descent) sharing that results from the direct lineage path 
of length C from each ancestor to each individual. A denotes the number of common ancestors: if A = 2, then these ancestors are 
mates, and the two individuals descend from distinct offspring of this union. 6' is the realized IBD genomic fraction from the 
indicated common ancestors, for which we show the expected (E) value (which is equal to the coancestry {6)), the equal-tailed 95% 
CI and P10'>O], the probability that the two individuals share any genomic region IBD from those ancestors. Also shown are the 
average number of SRs and, conditional on SR >0, the expected region length in mega base pairs {^^) and its standard deviation 
(SD). Estimates are based on 10^ Type A simulations (see Supplementary information Si (box)). *The value shown is the same as the 
one above by definition. 
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Full-siblings First cousins Second cousins Tliird cousins 

('unrelated') 

Highest level of relatedness present 



Figure 1 1 Estimation of narrow-sense heritability (h^) using either expected IBD (d) or realized IBD (d') for varying 
levels of relatedness. We used a Type B simulation (see Supplementary information Si (box)). From the current 
generation, we drew 4 samples of 1,250 individuals, first with no filtering so that siblings were included, followed by 
filtering to exclude close relatives (the x axis labels indicate the closest relationship included). For each sample, we 
generated 100 phenotypes with narrow-sense heritability (h^) of 0.5, where the correlation structure of the genetic 
contributions to each phenotype was specified by realized IBD (identity-by-descent; 8') based on G = 50 generations. To 
estimate it is necessary to specify a covariance matrix (K in equation 1 1). For each phenotype, we estimated using K 
constructed from 8' based on C = 50 (red boxes; the best-possible analysis in our model), 6' based on C = 10 (green), 
8' based on C = 5 (blue) or 8 based on C = 5 (purple; corresponding to a 5-generation pedigree). Boxes indicate the 
interquartile range for estimates, whiskers mark 1.5x this range, with values outside the 1.5x range individually plotted. 



Therefore, equation 1 can be written in the form of a 
correlation coefficient as follows. 



IMaximum likelihood 
estimators 

Estimates of unknown 
parameters obtained by 
maximizing the likelihood for 
the observed data given a 
statistical model. 

Method of moments 
estimators 

Estimates of unknown 
parameters obtained by 
equating theoretical moments 
(for example, mean, variance 
and skewness) under the 
assumed statistical model to 
empirical moments calculated 
from the observed data. 



Paa-Pl _ E[{U,-pJ{U,-pJ] 

Pai^-Pa) VVar[(/i]Var[(72] 
= Cor[[/i, [/J 



(2) 



If the two alleles are sampled at random in a subpop- 
ulation, then 6 equals the fixation index (f j^)''- However, 
if an allele is drawn from each of B and C — members of 
a finite pedigree with unrelated founders — then'* 0 in 
equation 1 is their coancestry (0(S,C) in BOX 1 ), and the 

are the founder allele probabilities in some reference 
population. Ignoring the expectation in equation 2 and 
using the observed values of and (7^, an unbiased esti- 
mator of 6 is obtained if the p^ are known. This estimator 
is imprecise because it is based on only a single locus but, 
as 6 is constant across loci, precision can be improved by 
averaging over loci (see equation 9 below). 

The value of 9 in equation 1 can be interpreted 
broadly, as representing any recent common origin of 
the two alleles; similarly, the concept of the reference 
population can be flexible. Interpreting d in terms of 
IBD originating within the past G generations provides 
a coherent framework', but we have already outlined 



above some of the practical difficulties. Moreover, it is 
difficult to estimate the time depths of IBD genomic 
regions because their lengths are highly variable, with 
mean decreasing only linearly with time". Alternatively, 
for two individuals sampled in a subpopulation, 6 can be 
interpreted in terms of an intra-subpopulation pedigree 
of which founders are immigrants from a global popu- 
lation"". However, the assumptions that alleles drawn 
from the global population are independent and that 
the p^ are known or well estimated remain problematic^ 
particularly in the presence of population structure''. 

There is a substantial amount of conservation genet- 
ics literature on estimators of d based on equation 2, 
and these estimators are mainly designed for tens 
of multiallelic markers, such as short tandem repeat 
(STR) loci^°"^l Maximum likelihood estimators have been 
proposed^^ but methods of moments estimators are often 
preferred despite their lower precision, because they are 
computationally efficient and can be unbiased if the p^ 
are known. In practice, not only are the p^ typically 
unknown, but the observed alleles from which they 
might be estimated are also not drawn from the ref- 
erence population. Moreover, when d is also estimated 
from the same data, the estimate d is biased downwards 
and is often negative, whereas 6>Q. The limited number 
of markers available until recently meant that 6 was too 
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Coalescent tree 

Each leaf of the tree 
corresponds to an observed 
allele, and the root represents 
the most recent common 
ancestor (MRCA) of all 
observed alleles. The internal 
nodes (branching points) 
represent the IvIRCA of the 
alleles at the leaves connected 
to that node (without passing 
the root). Distances along 
branches represent time, 
measured in generations. 

IBS 

(Identical-by-state; also 
identity-by-state). When two 
homologous alleles have 
matching type. Some 
definitions of IBS exclude IBD 
(identity-by-descent). 



imprecise for the issues raised here to be of practical 
concern. However, high-density SNP data now permit 
precise estimation, forcing us to confront interpretation 
difficulties. Properties of estimators can be assessed in 
artificial, truncated pedigrees, but this may not be rel- 
evant in practical applications because real pedigrees 
are effectively infinite. 

With 771 SNPs, a 2010 zebra finch study" found 
that direct estimation of 6 was poor and recommended 
using SNPs to reconstruct the pedigree as a preliminary 
step. By contrast, the results from a 2013 study on pigs^^ 
indicated that 2,000 SNPs could give relatedness esti- 
mates that are superior to those from known pedigrees. 
However, although recognizing the potential for better 
measures of relatedness from markers rather than pedi- 
grees, these authors still used pedigree-based measures 
as the gold standard' to assess marker panels. We suggest 
that use of equation 2 to estimate 6 in natural popula- 
tions should be avoided because of the problems of inter- 
pretation. Equation 2 may still give a useful summary of 
genome similarity, but it does not estimate a meaning- 
ful parameter except in artificial settings. Models and 
analyses can instead be formulated directly in terms of 
genome similarity, which raises the problem of how to 
compare measures of genome similarity (see below). 

Coalescent theory 

A different framework for describing relatedness in 
populations without pedigree information is provided 
by coalescent theory, in which alleles at a locus are 
connected through a coalescent tree^*". In its simplest 
form, the standard coalescent describes the probability 
distribution of the TMRCA of a set of homologous 
alleles, assuming random mating in a constant- size 
population. In that case, the probability for two line- 
ages to 'coalesce' at an ancestral allele more than G 
generations in the past is as follows, where N is the 
number of diploid individuals. 



P [TMRCA >G] = 1- 



1 
2N 



(3) 



This model can be generalized to allow for variable 
population size and some forms of population struc- 
ture and selection. The standard coalescent is based on 
assuming a Poisson number of offspring per individual 
and that each mating generates one offspring so full 
siblings are rare. However, it can be used to approxi- 
mate the properties of some more-complex models, by 
replacing N in equation 3 with an effective population 
size (NJ. For example, our Type B simulation with 
N= 10,000 can be approximated by a coalescent model 
with = 8,450, which is in close agreement with a the- 
oretical formula"" based on the variance of the number 
of offspring (see Supplementary information S3 (box) ). 

Under the coalescent model, the MRCA of two 
haploid human genomes at a given site is unlikely to 
be recent. In our Type B simulation model, the prob- 
ability of an MRCA in generation G is ~6 x 10"^ for G 
up to several hundred, which supports the assumption 
that people are unrelated if nothing is known about 



their relatedness. However, even for G = 78,000 (which 
is the 99* percentile of the TMRCA distribution) and 
assuming a mutation rate of 1.2 x 10~* per site per gen- 
eration", the probability of a mutation in either lineage 
since the MRCA is still low (-0.002). Therefore, any 
two human genomes will be IBS (identical-by-state) for 
almost all genomic sites as a result of IBD. However, the 
situation is different for STR loci, which are often used 
in forensic identification and relatedness testing; STR 
mutation rates are ~10~^ per site per generation and so, 
under the coalescent model, the majority of IBS will not 
be a consequence of IBD. 

Powell et alA acknowledged a conflict between pedi- 
gree-based IBD theory and coalescent theory but, as they 
recognized, there is ultimately no conflict: IBD can be 
described in terms of the more general coalescent theory. 
Broadly speaking, the IBD versus non-IBD distinction 
is a simplification of the coalescent theory in which the 
TMRCA is classified into recent and non-recent. This can 
sometimes be a useful simplification, but it does not pro- 
vide a satisfactory general notion of relatedness because 
none of the attempts to define 'recent' can solve the 
problem that a binary classification cannot capture 
the essentially continuous range of TMRCA values. A 
better approach is to define kinship coefficients in terms 
of genome-wide TMRCA distributions (BOX 2). 

There is a fundamental connection between coales- 
cent trees and pedigrees: a pedigree can be thought of 
as providing a 'scaffold' on which coalescent trees at 
different genomic loci are constructed. Pedigree mem- 
bers have maternal and paternal alleles at each locus, 
but each coalescent lineage passes through only one 
allele of each individual. Thus, we can consider the 
coalescent tree at a locus as a stochastic process on a 
fixed pedigree, making 'coin toss' decisions between the 
maternal and paternal chromosomes of each ancestor 
that is reached (see Supplementary information S4 
(box) ). Features of a more extensive pedigree, such as 
population structure, generate genome-wide influences 
on the coalescent distributions. This effect is evident 
in genome-wide association studies, in which popula- 
tion structure and cryptic relatedness (that is, pedigree 
effects) alter the genome-wide distribution of single- 
SNP association statistics^*. Coalescent modelling usu- 
ally ignores the pedigree because it is rarely observed 
and is difficult to infer. However, pedigree effects are 
not always negligible, and it may be useful in some set- 
tings to jointly model the pedigree and the coalescent 
trees embedded in it. 

Recombination-sense IBD 

In the absence of an explicit pedigree, IBD was ini- 
tially defined in terms of mutation^': pairs of alleles at 
a locus are mutation-sense IBD if there has been no 
mutation since their MRCA. IBD is now more com- 
monly defined as a property of a genomic region^"'^': 
two haploid genomes are recombination- sense IBD 
if there has been no recombination within the region 
since their MRCA, ignoring mutations. There is no ref- 
erence population in this approach, but the problem 
now is how to identify IBD regions, which are often 
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Box 2 I Relatedness in terms of times since common ancestors 

The coancestry 6(6, C) is the expected value of a function equal to 1 when homologous 
alleles from individuals B and C have a most recent common ancestor (MRCA) that is 
'recent' (within a specified pedigree, within C generations or within a subpopulation), 
and (?(6,C) = 0 otherwise. A better approach is to use a genome-wide average of a more 
informative function of the time since the MRCA (TMRCA). Although the TMRCA is 
typically unknown, it can be estimated from dense markers or sequence data under a 
demographic model. The estimate is imprecise at any one locus, but genome-wide 
estimates can be highly informative and have been used to draw inferences about 
historical demographic parameters'^. 

If bilineal relationships are of interest, then it is necessary to consider two 
genome-wide TIVIRCA distributions: the minimum TMRCA over the four pairs of alleles 
and the TMRCA of the remaining allele pair. Distributions are often summarized by 
their expectation, but for the TMRCA the probability assigned to recent times is of 
most importance, which suggests summarizing the TMRCA distribution by the 
expectation of a function that upweighs recent times, such as exp(-TMRCA/c) for 
some constant c. 

Slatkin" proposed defining fixation indices in terms of ratios of coalescence times, 
and Rousset' developed this idea to propose definitions of co-ancestries based 
on the excess TMRCA probability in recent generations. The excess is relative to 
a random pair of individuals, assuming that the TMRCA probabilities for (B,C) and 
the random pair are proportional for large TMRCA. This definition reduces to 
equation 1 in simple settings but does not require any pedigree or founder population 
to be specified. No estimator from marker data was proposed, but this may provide a 
promising approach to develop new statistics that summarize relatedness without the 
requirement of pedigree information or a reference population. 



short, and recombination events may not be detectable. 
Recombination- sense IBD does not lead naturally to a 
useful measure of genome-wide relatedness because all 
pairs of haploid genomes are entirely IBD; the question 
is where the breakpoints are between the IBD regions, 
and these can be hard to infer. 

The largest consumer genetic ancestry companies 
have databases with >10^ individuals, predominantly of 
European ancestry, and each genotyped at >10^ SNPs. 
A focus for such companies is to identify pairs of indi- 
viduals connected by short lineage paths. We discuss 
below the difficulties in using inferred IBD regions to 
achieve this, but here we question why discovering a 
poorly inferred, distant pedigree relationship based on 
sharing perhaps only one genomic region is preferred 
over seeking the highest level of genome-wide similar- 
ity. Part of the answer may be that measures of genome 
similarity have not been adequately developed. 

Another motive for inferring IBD regions is to under- 
stand population structure and demographic history''^'-'^. 
For example, detecting long regions shared IBD between 
individuals in different parts of the world can point to a 
recent migration event. The observed data may first be 
summarized in terms of IBD regions, and demographic 
inferences can then be based on these inferred regions. 
However, this two-step process will disregard any demo- 
graphic information that is not captured by the IBD 
inference, suggesting that more direct and statistically 
efficient demographic inferences may be possible. 

The statistics of IBD regions 

IBD regions can be measured in genetic distances 
(Morgans) or physical distances (base pairs). Genetic 
maps specify the relationship between them, which 



varies across the genome; they also differ across 
human populations, substantially at a fine scale"'". 

We report below region lengths in megabase pairs. 
The simulation-based estimates reported in TABLE 1 
for E[#SR] — the expected number of IBD regions for 
a pair of individuals — and for f^^ — the expected 
length of these regions — agree closely with theoreti- 
cal (sex- averaged) values for 22 autosomes spanning 
2,667 Mb with map lengths of 40.7 Morgans for females 
and 22.9 Morgans for males (see Supplementary 
information S5 (box)). 

22 -I- (40.7 -I- 22.9) x G 

E #SR = A X (4) 

22G-1 

and 

2667 

t^G = (5) 

22-1- (40.7 -I- 22.9) x G 

This implies that the mean length of IBD regions is 
just more than 2 Mb when G = 20 and just more than 
1 Mb when G = 40. Assuming again a mutation rate 
of 1.2 X 10"* per site per generation, the probability that 
an average-length genomic region shared (recombina- 
tion-sense) IBD from an ancestor G generations back 
is also mutation- sense IBD is approximately constant 
over G at -0.37 (see Supplementary information S6 
(box) ). Therefore, average-length pairs of IBD regions 
are unlikely to be identical at the sequence level, even if 
they are from a recent common ancestor. 

FIGURE 2A shows the distribution of IBD region 
lengths for G = 1 and G = 10 based on the Type A simu- 
lations underlying TABLE 1. For G = 1 there is a peak 
in lengths close to 30 Mb because about one-third of 
sibling pairs have complete IBD for at least one of chro- 
mosomes 21 and 22. A gamma distribution generally 
gives a good fit, except when G is very small owing 
to the difference between male and female recombina- 
tion rates. We estimate the gamma shape parameter to 
be approximately constant over G at -0.76, implying 
that the standard deviation (SD) is ~f^J^0.76. This is 
a higher SD than that for the exponential distribution 
(gamma distribution with shape parameter 1), which 
would apply to IBD region lengths if the recombina- 
tion rate were uniform across the genome and sexes. 
Even when an IBD region arises from a shared par- 
ent, there is a substantial probability for its length 
to be short, whereas the shared region could still be 
large even for an ancestor >20 generations in the past. 
EIGURE 2B shows the inverse distribution based on the 
Type B simulations used for FIG. 1 , and indicates how 
well the time depth of the common ancestor can be 
inferred from the region length. Very long shared 
regions (>80Mb) are highly likely to descend from 
a recent ancestor (G< 5), but G has a wide range for 
regions <40Mb. Up to 10 Mb in length, the major- 
ity of shared regions descend from an ancestor >20 
generations back. Our estimates are based on a very 
simple simulation model but are broadly consistent 
with an estimated age range of 32-52 generations for a 
10-centimorgan region shared by a pair of UK residents^^. 
There are some excellent blog discussions of issues 
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around the statistics of genome sharing among relatives 
at gcbias , Genetic Inference and On Genetics . 

SNP-based measures of relatedness 

We distinguish genome-wide averages of single-SNP 
statistics from haplotype-based methods (which are 
sometimes referred to, respectively, as IBS and IBD 
methods, or as methods that do not and do take account 
of linkage'^). Matrices of SNP-based kinship coefficients 
have been called genetic relatedness matrices in recent 
literature""'^' but, to avoid any implication that these 
matrices estimate pedigree-based relatedness, we prefer 
to call them genetic similarity matrices (GSMs). 

Single-SNP averages. We use S^. to denote the genotype of 
B at the/* diaUelic SNP, coded as 0, 1 or 2. Analogous with 
the definition of coancestry, a natural way to score the 
similarity of two individuals at each SNP is as the prob- 
ability of a match between alleles drawn at random from 
each of them. In that case matching homozygotes (0,0) 
or (2,2) score 1; discordant homozygotes (0,2) score 0; 
while (0,1), (1,1) and (1,2) all score 0.5. Averaged over 
m SNPs, this gives an allele- sharing coefficient^''^' as 
follows, where is a (row) vector with/'' entry S^^ - 1. 



KJB,C)=\ + ^ i(\-l)(Sc,-l) 
2 2m j = i 



1 

2m 



(6) 



I- Xr 



The corresponding GSM is then (1 -l-XX^/m)/2, 
where X is a genotype matrix with row B equal to X^. 
The range of K depends on the MAP spectrum of the 
SNPs, and K "(£,£) = (1 -I- hJ/2, where /!„ denotes 
the homozygosity of B. Recall that 6(B,B) = (1 +fg)/2, 
where denotes the inbreeding coefficient of B. K^^ can 
be interpreted as average mutation- sense IBD under the 
assumption that IBS implies IBD, which is reasonable 
for SNPs given their low mutation rate. 

The case (1,1) of two heterozygotes can correspond 
to either IBD = 2 or IBD = 0 depending on phase, which 
is often unknown. Many authors" prefer to score 
matching heterozygotes as 1 rather than 0.5, resulting 
in the following allele- sharing similarity, which has 
K'(B,B) = L 



K'JB,C} = 1 



— SIS. 
2m j=i 



•Bj ^Cjl 



(7) 



Aa Common ancestor 1 generation back 

0.06 n 



Ab Common ancestor 10 generations back 

0.30 n 




20 30 
IBD region length (Mb) 



20 30 40 

IBD region length (Mb) 



50 



^ 0.6 




Figure 2 | Statistics of IBD genomic regions. Distributions of lengths 
of genomic regions shared IBD (identity-by-descent) from a common 
ancestor of 1 generation (part Aa) and 10 generations (part Ab) in the 
past are shown. The solid black curves show an approximating gamma 
distribution, with mean given by equation 4 and a shape parameter of 
0.76. For all IBD regions arising from a common ancestor within the last 
50 generations, the bars show how the distribution of the generation 



of the common ancestor depends on the length of the region (part B). 
From bottom to top in the graph, the tranches correspond to C = 1 
(red), C = 2... 9 (alternating dark and light blue), C = 10 (green), 
G = 11... 20 (alternating dark and light blue) and C >20 (grey). Plots in 
part A are based on Type A simulations and part B on a Type B 
simulation, details of which are provided in Supplementary 
information Si (box). 
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Linkage disequilibrium 

(LD), A population correlation 
of allele pairs drawn at 
different genomic loci in the 
same gamete (that is, in a 
haploid genome). 



K^^ and K'^^ give all SNPs equal weight, irrespective of 
MAFs, which has the advantage of avoiding estimation 
of population MAF values. However, the lower the MAF 
the greater the evidence for a recent common ancestor, 
which suggests giving more weight to rare shared alleles. 
One step in this direction is to centre the genotype 
scores. If p. is the population MAF at the /** SNP, and 
now defining as having entries S^.-lp., then we can 
define the centred coefficient as follows. 



KJB,C) = -'Z(Sb 
m ; = i ■ 



■2p,){Scj-2p,)J 



(8) 



The GSM is XXVm. Some authors"-""'*' use 
2Z°Ljp (1 -p) in place of m in the denominator of equa- 
tion 7, so that K AB,B) = 1 if B is outbred. Unlike K 

cO^ ' as 

and K' , the value of K „ can be negative, and this must 

as cO c> 

frequently occur if the p. are estimated from the same 
data, as then K measures the excess or deficit of allele 
sharing from that expected under a random assignment 
of alleles. Some authors replace negative values with 
zero*^, which may be motivated by the belief that a non- 
negative 6 is being estimated. However, the differences 
in genome similarity among unrelated individuals can 
be informative (FIG. 1), and setting negative values to 
zero discards much of this information. 

In addition to centring, it is common to standard- 
ize, which assumes each SNP to be equally informative, 
leading to the following definition''^''*. 



-XgX^ where Xg 



(9) 



K^^(B,C) is an average over SNPs of an estimator 
of an allelic correlation coefficient (see equation 2). It 
is the basis of principal component methods to adjust 
for population structure in association studies''^ A 
regression adjustment to K^_^ has been proposed, 
imposing shrinkage towards zero in accordance with 
sample size'^, which was shown in simulations* to offset 
errors when estimating 6 or h^. With high SNP density, 
genetic variation tagged by multiple SNPs in high linl<age 
disequilibrium (LD) will be over-represented when 
calculating genomic similarities, which can have 
adverse implications for heritabUity analyses. Adjusting 
for this by reweighting SNPs in the calculation of K^_^ 
gives improved estimates**". 

K , is an unbiased and efficient estimator of 6 in 

c- 1 

equation 2, provided that the modelling assumptions 
hold but, as discussed below equation 2, those assump- 
tions rarely hold in practice (similar comments apply to 
KJ. Although K^^, K'^, K„ and K _j all provide widely 
used SNP-based measures of genome similarity, there 
are no good grounds to regard any of them as a canoni- 
cal SNP-based measure of relatedness. This frees us to 
search for measures that perform well in specific appli- 
cations. In practice, genome-wide genome similarity is 
useful because it gives a guide to allele sharing at causal 
loci for the trait of interest*, and so the best genome 



similarity coefficient will reflect the genomic archi- 
tecture of the trait. This contrasts fundamentally with 
kinship coefficients, which are trait-independent, but 
the new flexibility can be highly profitable in terms of 
understanding genetic mechanisms. 

One place to start the search for better GSMs is the 
one-parameter family K^^, defined as in equation 9 
except that X^. is now defined as follows. 



X«^-=(SB^.-2p^-)xk/2p,-(l-p,)r 



(10) 



The special cases a = -1 (equation 9) and a = 0 (equa- 
tion 8) are widely used in practice. We compare below 
the performance of GSMs based on K for a = -2, -1, 
0 and 1. These values are chosen for illustration, and a 
thorough study should consider additional values for a. 

Methods based on detecting shared haplotypes. 

Averages of single-SNP coefficients do not take into 
account the lengths of genomic regions shared between 
two individuals. On average, the longer the shared 
region (or regions), the more recent the ancestor (or 
ancestors), which we have noted is relevant to some 
population genetics applications'^ ". With unphased 
genotypes, a simple approach is to seek genomic 
regions for which two individuals share at least one 
allele at every SNP'^ More sophisticated approaches 
usually require genotype data to be phased. Methods 
for phasing*'"^' typically use a hidden Markov model 
with the aim of constructing haplotypes (the hidden 
states) that are consistent with the observed genotypes, 
allowing for mutation and recombination'^. Given the 
haploid data, it is then straightforward to identify IBS 
regions. However, a long region shared IBS by two indi- 
viduals may have resulted from two or more lineage 
paths and is therefore not (recombination-sense) IBD. 
Conversely, an IBD region may consist of several IBS 
regions that are interrupted by occasional data errors 
or mutations since the MRCA. Inferring such regions 
requires a model for recombinations and mutations, 
which in turn implies a model for the demographic 
history of the population". Methods for identifying 
IBD regions"''*"" differ in the size of data sets they can 
handle, which depends on the type of deterministic or 
stochastic search algorithm used. These algorithms can 
be sensitive to parameter choices, yet there is often no 
obvious way to tune these based on the data'^. For the 
largest data sets, simultaneous phasing of all individu- 
als is computationally infeasible. A recently developed 
method avoids explicit phasing but 'penalizes' pro- 
posed IBD intervals according to the estimated number 
of implied phasing switches*". 

Current approaches typically neglect shorter IBD 
regions because they are harder to infer This means 
that distantly related pairs of individuals will have lit- 
tle or no IBD detected, and so differences in genome- 
sharing among them will be poorly recorded. FIGURE 2 
shows that inferring the TMRCA based on region 
length is difficult but, for many applications, exploita- 
tion of the information in short IBD regions would be 
advantageous. 
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Chromopainter''' offers a different approach that is 
based directly on the haplotype copying model''-. Every 
chromosome is regarded as a mosaic of fragments cop- 
ied from other sampled chromosomes, possibly with 
some mutations, and the coancestry of two individuals 
is measured by the number of distinct copying events 
between them. Although copying is intended to reflect 
IBD, every part of every chromosome is copied from 
another chromosome, and so an individual that is not 
closely related with anyone else in the sample will have 
the closest genome matches recorded even if these 
are remote. 

Heritability and phenotype prediction 

We focus below on the use of GSMs to estimate and 
to predict phenotypes. Both applications traditionally 
used a matrix of 6 values, which is now usually replaced 
by a GSM, to model phenotypic correlations among 
individuals, with the intuition being that the higher 
the genome similarity of two individuals, the more 
correlated their phenotypes that are under genetic 
control. 

The linear mixed model. Underlying both types of 
inference is the following regression model, in which a 
matrix K specifies the covariance structure of a vector of 
observed phenotypes Y, where N represents the normal 
(or Gaussian) distribution. 



(11) 



In this equation, 13^ represents fixed effects cor- 
responding to covariates in Z, I denotes the identity 
matrix, and tr^ and a- are the genetic and environmental 
variances, respectively. Given Y, Z and K, we typically 
estimate and (7^ using restricted maximum likeli- 
hood (REML)''^, which seeks values that maximize the 
restricted model likelihood (see Supplementary infor- 
mation S7 (box) ). For estimation, we are interested 
in the ratio of variance terms; when K is standardized 
to have a mean of zero and a mean diagonal value of 
one, h'' = (7^ /((T^ -I- (T^). A key technique for phenotype 



prediction in plant and animal breeding is best linear 
unbiased prediction (BLUP)"", which predicts the 
phenotypes of new individuals from estimates of Ka^^. 

SNP-based analyses that were pioneered in wild pop- 
ulations*^ have been extensively applied in animal and 
plant breeding''*-''' and, more recently, in humans'^''°'^^ 
A feature of SNP-based in humans is the use of unre- 
lated individuals. This is counterintuitive because more 
relatedness generates more precise inferences. However, 
the problem is that inferences vary according to the 
levels of relatedness among the sampled individuals. In 
addition, most readily available data are from popula- 
tion samples that include little relatedness. By exclud- 
ing any close relatives, sampled individuals only share 
the short genomic regions from remote ancestors that 
generate LD, which is reasonably stable across popu- 
lation samples. Furthermore, although high levels of 
relatedness would generate long-range tagging of causal 
variants, which can therefore all contribute to h-^, the 
ability to attribute h-' to specific genomic regions would 
be greatly reduced. Despite the information loss from 
reduced relatedness, reasonable precision (SD < 0.05) 
can be achieved with, for example, 5,000 unrelated indi- 
viduals'^, and the estimates are more consistent across 
studies. SNP-based values using unrelated individu- 
als have been interpreted in terms of common causal 
variants because these are better tagged by SNPs than 
rare variants. However, poorly tagged rare variants will 
contribute to (REF. 73], which hinders interpretation. 
Even so, it is possible to make useful comparisons across 
genomic regions''' and across phenotypes. 

Prediction accuracy can be measured by the cor- 
relation (r) between observed and predicted pheno- 
types across test individuals. The squared correlation 
(r^) is bounded above by ft^ but in practice r^<< h^. 
Relatedness between training and test individuals 
enhances predictive accuracy in the test set, but this 
may give an over- optimistic assessment of performance 
if the model is applied to new, less-related individuals. 
In humans, prediction of complex traits is typically 
poor in the general population but can be useful in 
high-risk groups""™. 



Table 2 | Model log lil<eliliood, lieritability estimates (h^) and predictive accuracy (r^) for different SNP-based GSMs 
Trait Log lil<elihood l-ieritability (h^) Prediction accuracy (r^) 





























Bipolar disorder 


-97 


0'' 


-12 


-32 


1.00'* 


0.98 


0.92 


0.81 


0.040 


0.074* 


0.073 


0.069 


Coronary artery disease 


-24 


-3 


0* 


-1 


0.33 


0.41* 


0.17 


0.06 


0.000 


0.017 


0.020* 


0.02 


Crohn's disease 


-178 


-5 


0* 


-3 


1.00 


1.00 


1.00 


1.00 


0.057 


0.096 


0.098* 


0.095 


Hypertension 


-32 


-3 


0* 


-1 


0.57'' 


0.48 


0.21 


0.08 


0.005 


0.024 


0.026* 


0.026 


Rheumatoid arthritis 


-125 


0'* 


-15 


-72 


0.77 


1.00'* 


0.99 


0.17 


0.016 


0.043 


0.042 


0.043* 


Type 1 diabetes 


-65 


0'' 


-7 


-16 


0.85'* 


0.82 


0.41 


0.16 


0.031 


0.060 


0.060* 


0.056 


Type 2 diabetes 


-28 


0'^ 


0 


-3 


0.64'" 


0.52 


0.22 


0.08 


0.009 


0.026* 


0.025 


0.024 


Average 


-78 


-2* 


-5 


-18 


0.74 


0.74* 


0.56 


0.34 


0.022 


0.048 


0.049* 


0.047 



GSM, genetic similarity matrix; SNR single-nucleotide poLymorpliism. Data for seven disease traits are from the Wellcome Trust Case Control Consortium. 

Tlie GSMs considered are /C^^for ae {-2,-1, 0, 1]. Log likelihoods, computed under the mixed model (equation 11), are reported relative to the maximum observed 

across GSMs. values correspond to the observed scale. *The GSiVls marked by asterisks indicate those that maximize the model likelihood, }r and 
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How to choose K? The success of both estimation 
and phenotype prediction using BLUP depends on the 
choice of Kin equation 11. Pedigree-based lvalues vary 
with the choice of pedigree, although in practice there is 
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Figure 3 | Heritability estimates (h^) and predictive accuracy (r^) for 139 mouse 
phenotypes using different SNP-based GSMs. The genetic similarity matrices (CSMs) 
considered are fC^^for as {-2,-1, 0, 1}, K'^^ and K^^^ (which is a matrix recording inferred 
fractions of I BD (identity-by-descent)). The vertical dashed lines separate different 
categories of phenotype: behaviour, diabetes, asthma, immunology, haematology and 
biochemistry. Solid points indicate the GSM providing highest heritability estimates 
(h^; part a) or predictive accuracy (r^; part b) for each phenotype. Average and 
values for each GSM are provided in parentheses. Data are from the Wellcome Trust 
Heterogeneous Stock Mouse Collection . SNP, single-nucleotide polymorphism. 



often httle choice. The range of options for SNP-based 
K is much greater. In human genetics, i^^^ j is often cho- 
sen, which implies an assumption that all SNPs explain 
the same fraction of phenotypic variance, and so effect 
sizes tend to increase as MAP decreases. By contrast, 
„ is often preferred in animal and plant breeding, and 
implies the same effect size distribution for all SNPs. 
These are both special cases of K ^ and some other value 
of a may provide better results, depending on the true 
relationship between MAFs and effect sizes. 

With many GSMs available, we can now choose 
K based on performance in applications. In simula- 
tions (see Supplementary information S8 (box) ) we 
found near-perfect fe^ estimation and phenotype pre- 
diction when the true K (that was used for simulating 
the phenotypes) was used in the analysis. If instead we 
use K^^ for various a, we find that maximizing does 
not reliably recover the true value of a, but that maxi- 
mizing the model likelihood and the predictive accu- 
racy both give a good guide to a, suggesting that these 
provide useful criteria for choosing K. These results 
also confirm that a GSM chosen to suit the architec- 
ture of the trait under study will perform better than a 
phenotype-independent choice. 

TABLE 2 shows results from estimation and pheno- 
type prediction for seven human disease traits from the 
Wellcome Trust Case Control Consortium™, for which 
we use K^^ in equation 1 1 with a = -2, -1, 0 and 1. When 
dealing with binary outcomes, it is preferable to con- 
sider h'' on the liability scale*", but for the purpose of 
making comparisons the observed scale is adequate^'. 
On average, the model likelihoods are maximized when 
a = -1 , whereas a = 0 is slightly superior for prediction. 
The latter result may reflect that, owing to the difficulty 
in estimating effect sizes for rare variants, prediction can 
perform well with a above its true value, as this increases 
the weight given to common variants. 

FIGURE 5 shows and for highly related mice 
using data from the Wellcome Trust Heterogeneous 
Stock Mouse Collection *'. We consider 139 pheno- 
types spanning behavioural, haematological, biochemi- 
cal and disease-related phenotypes. In addition to K^^ 
for a = -2, -1, 0 and 1, we also consider K' and (a 

as IBD ^ 

matrix recording pairwise IBD fractions inferred using 
FastlBD"). The presence of high relatedness enables 
accurate estimation and prediction, and the different 
GSMs perform similarly overall. K^^^ is the worst over- 
all for prediction, which suggests that its high value 
may be inflated. However, K^^^ gives the best prediction 
for two of the seven phenotype categories, reflecting 
that the best GSM depends on the genetic architecture 
of the trait. Specifically, K^^^ performs well when the 
causal variants are not well tagged by the SNPs, which in 
turn suggests a lesser role for common causal variants. 

Random regression models. For any K of the form XX^, 
which includes K but neither K nor K' , the mixed 

ca as as 

regression model (equation 11) can be reformulated as 
a random regression model in which the phenotype of 
B is expressed as follows'"'', where /3^~N(0, ctrp and cis 
a constant. 
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Simulations indicate^^ that this model is effective for 
estimating h^. However, for prediction, the assumption 
that all effect sizes follow a Gaussian distribution with 
constant variance is a severe limitation, so recently 
there have been many attempts to find more suitable 
models^^"^*. 

When equation 12 is available, it provides a more 
interpretable model in which K has no role: is just var- 
iance explained in a regression model with SNP predic- 
tors. Historically, kinship has played a central part in h~ 
analysis, so it seems unthinkable to estimate without 
this concept or a SNP-based proxy, but we believe that 
benefits accrue from the freedom to develop statistical 
structures that are tailored to particular traits, which dif- 
fer fundamentally from traditional kinship coefficients. 



Conclusion 

Quantitative genetics has undergone exciting devel- 
opments in recent years, and this is affecting even the 
most fundamental aspects of the discipline, including 
our understanding of relatedness. The relatively simple 
IBD theory based on pedigrees remains useful in some 
areas, but particularly in natural populations we require 
more flexible and interpretable concepts to fully take 
advantage of the high numerical precision afforded 
by genome-wide SNP and sequence data. On the one 
hand, it seems clear that a satisfactory general defini- 
tion of relatedness must be based on concepts from 
coalescent theory, and particularly on genome-wide 
TMRCAs. On the other hand, the choice of numerical 
measures of relatedness can be driven by optimizing 
criteria that are relevant to applications, such as model 
likelihood and predictive accuracy. There is much 
progress to be made in both directions. 



Crafen, A. A geometric view of relatedness. Oxford 1 6. 

Surv. Evol. Biol. 2. 28-90 [1 985). 

Maynard Smith, J. Evolutionary Genetics [Oxford 1 7. 

Univ. Press, 1998). 

Rousset, F. Inbreeding and relatedness coefficients: 1 8. 

what do they measure? Heredity 88, 371-380 

[2002]. 19. 

This paper gives a critical examination of kinship 

coefficients and proposes a new approach to 

measure kinship based on a cumulative excess of 20. 

recent coalescences. 

Powell, J., Visscher, P. & Goddard, M. Reconciling the 
analysis of IBD and IBS in complex trait studies. 
Nature Rev. Genet. 11, 800-805 [2010]. 21. 
This is a review on IBS and IBD concepts, with a 
focus on choice of reference population; it also 
discusses SNP-based computation of relatedness 
coefficients and their use in heritability estimation. 22. 
Weir, B., Anderson, A. 6i Hepler, A. Genetic 
relatedness analysis: modern data and new 
challenges. A/ott/re/?ei/. Genet. 7, 771-780 [2006). 23. 
Kong, A. etal. Fine-scale recombination rate 
differences between sexes, populations and 24. 
individuals. Nature ^67, 1099-1103 (2010]. 
Corr, P. & Kippen, R. The case for parity and birth- 
order statistics. Aust. N. Z. J. Stat. 48, 1 71-200 
[2006]. 

Calboii, F, Sampson, J., Fretwell, N. fit Balding, D. 25. 
Population structure and inbreeding from pedigree 
analysis of purebred dogs. Genetics 179, 593-601 
[2008]. 26. 
Thompson, E. Identity by descent: variation in meiosis, 
across genomes, and in populations. Genetics 1 94, 
301-326 (2013]. 27. 
This is an extensive review on the IBD concept that 
covers many applications and citations to early 
literature. We disagree with the conceptual 28. 
framework, but there is much that is valuable in 
this review. 

Visscher, P. et al. Assumption-free estimation of 29. 

heritability from genome-wide identity-by-descent 

sharing between full siblings. PLoS Genet. 2. e41 30. 

[2006]. 

This paper introduces a clever innovation for 
heritability estimation and is the first to exploit 3 1 . 

differences in realized IBD among pairs of individuals 
with the same pedigree-based relatedness. 

Hill, W. G. On estimation of genetic variance within 

families using genome-wide identity-by-descent 32. 

sharing. Genet. Sel. Evol. 45, 32 [201 3). 

Yang, J. etal. Common SNPs explain a large 

proportion of the heritability for human height. Nature 

Genet. 42, 565-569 [2010). 

Falconer, D. & Mackay, T. Introduction to Quantitative 
Genetics 4th edn [Longman, 1 996]. 33. 
Donnelly, K. The probability that related individuals 
share some section of genome identical by descent. 
Theor Popul. Biol. 23, 34-63 [1983]. 
Kong, A. et al. Detection of sharing by descent, 34. 
long-range phasing and haplotype imputation. Nature 
Genet 40, 1068-1075 [2008]. 



Crow, J. & Kimura, M. An Introduction to Population 35. 

Genetics Theory (Harper and Row, 1 970]. 

Wright, S. The genetical structure of populations. 

Ann. Eugen. 15, 159-171 (1951]. 

Wright, S. Coefficients of inbreeding and relationship. 

Amer Nat. 61 , 330-338 (1 922). 

Manichaikul, A. etal. Robust relationship inference in 36. 
genome-wide association studies. Bioinformatics 26, 
2867-2873 (2010]. 

Csillery, K. etal. Performance of marker-based 

relatedness estimators in natural populations of 37. 

outbred vertebrates. Genetics 173, 2091-2101 

[2006]. 

Oliehoek, P., Windig, J., van Arendonk, J. & Bijma, P. 38. 

Estimating relatedness between individuals in general 

populations with a focus on their use in conservation 

programs. Genetics 173, 483-496 [2006]. 39. 

Beaumont, M. in Handbook of Statistical Genetics 

[eds Balding, D., Bishop, M. & Cannings, C] Ch. 30 

[Wiley, 2007). 40. 

Thompson, E. The estimation of pairwise relationships. 

Ann. Hum. Genet. 39, 173-188 (1975). 

Santure, A. etal. On the use of large marker panels to 41 . 

estimate inbreeding and relatedness: empirical and 

simulation studies of a pedigreed zebra finch 42. 

population typed at 771 SNPs. Mol. Ecol. 19, 

1439-1451 [2010). 

Lopes, M. et al. Improved estimation of inbreeding 43. 
and kinship in pigs using optimized SNP panels. BMC 
Genet. 14, 92 [2013]. 

Nordborg, M. in Handbook of Statistical Genetics [eds 
Balding, D., Bishop, M. 6t Cannings, C] Ch. 25 (Wiley, 44. 
2007). 

Kong, A. et al. Rate of de novo mutations and the 
importance of father's age to disease risk. Nature 45. 
488, 471-475 (2012]. 

Astle, W. & Balding, D. Population structure and 

cryptic relatedness in genetic association studies. 46. 

Statist. Sci. 24, 451-471 [2009]. 

Malecot, G. The Mathematics of Heredity (Freeman, 

1969]. 

Sved, J. Linkage disequilibrium and homozygosity of 47. 
chromosome segments in finite populations. 
Theoretical Popul. Biol. 2, 125-141 [1971]. 
Hayes, B., Visscher, P., McPartlan, H. & Goddard, M. 
Novel multilocus measure of linkage disequilibrium to 
estimate past effective population size. Genome Res. 48. 
13,635-643 [2003). 

Lawson, D. & Falush, D. Population identification using 

genetic data. Annu. Rev Genet. 1 3, 337-361 [201 2). 

This is a review on available GSMs, both that do 49. 

and do not take account of linkage, from the 

perspective of classifying individuals into 

populations. 

Graffelman, J., Balding, D., Gonzalez- Neira, A. & 50. 
Bertranpetit, J. Variation in estimated recombination 
rates across human populations. Hum. Genet. 122, 
301-310(2007]. 51. 
Wegmann, D. et al. Recombination rates in admixed 
individuals identified by ancestry-based inference. 
Nature Genet. 43, 847-853 (2011). 



Ralph, P. 5i Coop, G. The geography of recent genetic 
ancestry across europe. PLoS Biol. 1 1 , elOOl 555 
(2013). 

This paper investigates IBD genome sharing across 
Europe and how this reflects population size and 
migrations over recent millennia. 

Forni, S., Aguilar, I. 6i Misztal, 1. Different genomic 
relationship matrices for single-step analysis using 
phenotypic, pedigree and genomic information. Genet. 
Sel. Evol. 43, 1-7 [2011]. 

Lee, J. Y. S., Goddard, M. & Visscher, P GCTA: atool 
for genome-wide complex trait analysis. Am. J. Hum. 
Genet. 88, 76-82 [2011]. 

Toro, M. etal. Estimation of coancestry in Iberian pigs 
using molecular markers. Conserv. Genet. 3, 309-320 
[2002]. 

Purcell, S. et al. PLINK: a toolset for whole-genome 
association and population-based linkage analysis. 
Am. J. Hum. Genet. 81 , 559-575 [2007]. 
Habier, D., Fernando, R. 5i Dekkers, J. The impact of 
genetic relationship information on genome-assisted 
breeding values. Genetics 177, 2389-2397 [2007). 
VanRaden, P. Efficient methods to compute genomic 
predictions. J. Dairy Sci. 91 , 4414-4423 [2008]. 
Yu, J. et al. A unified mixed-model method for 
association mapping that accounts for multiple levels 
of relatedness. Nature Genet. 38, 203-208 (2006). 
Loiselle, B., Sork, V., Nason, J. & Graham, C. Spatial 
genetic structure of a tropical understory shrub 
Psychotria officinalis [Rubiaceae]. Am. J. Bot. 82, 
1420-1425 [1995]. 

Amin, N., van Duijn, C. & Aulchenko, Y. Agenomic 
background based method for association analysis in 
related individuals. PLoS 0NE2, el274 (2007]. 
Price, A. etal. Principal components analysis corrects 
for stratification in genome-wide association studies. 
Nature Genet. 38, 904-909 [2006]. 
Speed, D., Hemani, G., Johnson, M. & Balding, D. 
Improved heritability estimation from genome-wide 
SNP data. /^m. J. Hum. Genet. 91, 1011-1021 

[2012) . 

Scheet, R & Stephens, M. A fast and flexible statistical 
model for large-scale population genotype data: 
applications to inferring missing genotypes and 
haplotypic phase. Am. J. Hum. Genet. 78, 629-644 
[2006]. 

Li, v., Wilier, C, Ding, J., Scheet, R & Abecasis, G. 

Mach: using sequence and genotype data to estimate 

haplotypes and unobserved genotypes. Genet. 

Epidemiol. 34, 816-834 (2010). 

Browning, B. 6i Browning, S. A unified approach to 

genotype imputation and haplotype-phase inference 

for large data sets of trios and unrelated individuals. 

Am. J. Hum. Genet. 84, 210-223 [2009]. 

Howie, B., Marchini, J. & Stephens, M. 

Genotype imputation with thousands of genomes. G5 

1, 457-470 [2011]. 

Delaneau, O., Zagury, J. & Marchini, J. Improved 
whole-chromosome phasing for disease and 
population genetic studies. Nature Methods 10, 5-6 

[2013) . 



NATURE REVIEWS I GENETICS 



© 2014 Macmillan Publishers Limited. All rights reserved 



ADVANCE ONLINE PUBLICATION | 11 



REVIEWS 



52. Li, N. & Stephens, M. Modeling linkage disequilibrium 67. 
and identifying recombination hotspots using 
singlenucleotide polymorphism data. Genetics 165, 68. 
2213-2233 (2003). 

53. Thompson, E. The IBD process along four 
chromosomes. Theor. Popul. Biol. 73, 369-373 69. 

(2008) . 

54. Cusev, A. et al. Whole population, genome-wide 

mapping of hidden relatedness. Genome Res. 1 9, 70. 
318-326 (2009). 

55. Bercovici, S., Meek, C, Wexler, Y. 6i Ceiger, D. 

Estimating genome-wide IBD sharing from SNP data 71 . 
via an efficient hidden Markov model of LD with 
application to gene mapping. Bioinformatics 26, 
il75-il82 (2010). 

56. Moltke, I., Albrechtsen, A., Hansen, T, Nielsen, F. & 72. 
Nielsen, R. A method for detecting IBD regions 
simultaneously in multiple individuals — with 
applications to disease genetics. Genome Res. 1 21 , 73. 
1168-1180 (2011). 

57. Browning, B. & Browning, S. A fast, powerful method 

for detecting identity by descent. Am. J. Hum. Genet. 74. 
88, 173-182 (2011). 

58. Browning, B. & Browning, S. Improving the accuracy 

and efficiency of identity-by-descent detection in 75. 
population data. Cenet/cs 194,459-471 (2013). 

59. Li, H. etal. Relationship estimation from whole- 
genome sequence data. PLoS Genet. 10, e10041 44 
(2014). 76. 

60. Durand, E., Eriksson, N. 6t McLean, C. Reducing 

pervasive false-positive identical-by-descent segments 7 7 . 

detected by large-scale pedigree analysis. Mo/. Biol. 

Evol. 31, 2212-2222 (2014). 78. 

61 . Hellenthal, C, Auton, A. & Falush, D. Inferring human 
colonization history using a copying model. PLoS 

Genet. 4, el000078 (2008). 79. 

62. Corbeil, R. 6i Searle, S. Restricted maximum 
likelihood (REML) estimation of variance components 
in the mixed model. Technomethcs 18, 31-38 

(1976). 80. 

63. Henderson, C. Estimation of genetic parameters. 

Ann. Math. Stat. 21 , 309-31 0 (1 950). 81 . 

64. Henderson, C, Kempthorne, O., Searle, S. & von 
Krosigk, C. The estimation of environmental and 

genetic trends from records subject to culling. 82. 
Biometrics 15, 192-218 (1959). 

65. Mousseau, T, Ritland, K. 5i Heath, D. A novel method 

for estimating heritability using molecular markers. 83. 
Heredity 80, 21 8-224 (1 998). 

66. Hayes, B., Bowman, P., Chamberlain, A. 6i 

Coddard, M. Genomic selection in dairy cattle: 84. 
progress and challenges. J. Dairy Sci. 92, 433-443 

(2009) . 



Coddard, M. & Hayes, B. Genomic selection. J. Anim. 
Breed. Genet. 124, 323-330 (2007). 
Coddard, M. Genomic selection: prediction of accuracy 
and maximisation of long term response. Genetica 
139,245-257 (2009). 

Scutari, M., Mackay, I. & Balding, D. Improving the 
efficiency of genomic selection. Stat. Appl. Genet. Mol. 
12, 517-527 (2013). 

Makowsky, R. etal. Beyond missing heritability: 
prediction of complex traits. PLoS Genet. 7, 
el002051 (2011). 

de los Campos, G., Hickey J., Pong-Wong, R. 5i 
Daetwyler, H. Whole genome regression and 
prediction methods applied to plan and animal 
breeding. Genetics 193, 327-345 (2013). 
Visscher, P. et al. Statistical power to detect genetic 
[co)variance of complex traits using SNP data in 
unrelated samples. PLoS Genet 10, el 004269 (2014). 
Dickson, S., Wang, K., Krantz, 1., Hakonarson, H. & 
Goldstein, D. Rare variants create synthetic genome- 
wide associations. PLoS Biol. 8, el000294 (2010). 
Yang, J. et al. Genomic partitioning of genetic variation 
for complex traits using common SNPs. Nature Genet. 
43, 519-525 (2011). 
de los Campos, G., Gianola, D. & Allison, D. 
Predicting genetic predisposition in humans: the 
promise of whole-genome markers. Nature Rev. Genet 
11,880-886 (2010). 

Dudbridge, F. Power and predictive accuracy of 
polygenic risk scores. PLoS Genet. 9, e03348 (201 3). 
Wray N. etal. Pitfalls of predicting complex traits 
from SNPs. Nature Rev. Genet 14, 507-515 (201 3). 
Speed, D. et al. Describing the genetic architecture of 
epilepsy through heritability analysis. Brain 1 37, 
2680-2689 (2014). 

The Wellcome Trust Case Control Consortium. 
Genome-wide association study of 1 4, 000 cases of 
seven common diseases and 3,000 shared controls. 
Nature ^^1, 661-678 (2007). 
Dempster, E. 6i Lerner, I. Heritability of threshold 
characters. Genetics 35, 212-236 (1950). 
Valdar, W. et al. Genome-wide genetic association of 
complex traits in heterogeneous stock mice. Nature 
Genet 38, 879-887 (2006). 
Zhou, X., Carbonetto, P. & Stephens, M. 
Polygeneic modeling with Bayesian sparse linear 
mixed models. PLoS. Genet 9, el003264 (2013). 
Speed, D. 6f Balding, D. MultiBLUP: improved SNP- 
based prediction for complex traits. Genome Res. 24, 
1550-1557 (2014). 

Lippert, C. et al. The benefits of selecting phenotype- 
specific variants for applications of mixed models in 
genomics. Sa Rep. 3, 1815(2013). 



85. Cockerham, C. Higher order probability functions of 
identity of alleles by descent. Genetics 69, 235-246 
(1971). 

86. Cannings, C. St Thomas, A. in Handbook of Statistical 
Genetics (eds Balding, D., Bishop, M. & Cannings, C.) 
Ch. 23 (Wiley, 2007). 

87. Jacquard, A. The Genetic Structure of Populations 
(Springer, 1 974). 

88. Cotterman, C. A. in Genetics and Social Structure (ed. 
Ballonoff, P A.) 1 57-272 (Dowden, Hutchinson & 
Ross, 1974). 

89. Guo, S. Variation in genetic identity among relatives. 
Hum. Hered. 46, 61-70 (1996). 

90. Hill, W. G. 6( Weir, B. S. Variation in actual relationship 
among descendants of inbred individuals. Genet Res. 
94, 267-274 (2012). 

91 . Fisher, R. The Theory of Inbreeding (Oliver and Boyd, 
1949). 

92. Li, H. & Durbin, R. Inference of human population 
history from individual whole-genome sequences. 
Nature ^75, 493-496 (2011). 

93. Slatkin, M. Inbreeding coefficients and coalescence 
times. Genet Res. 58, 167-175 (1991). 

Ackn owl edge m e nts 

The authors thank G. Hellenthal and D. Kennett (both 
University College London), and M. Beaumont (University of 
Bristol) for discussion. This work is funded by the UK Medical 
Research Council under grant G0901 388, with support from 
the National Institute for Health Research University College 
London Hospitals Biomedical Research Centre. Access to 
Wellcome Trust Case Control Consortium data was authorized 
as work related to the project "Genome-wide association 
study of susceptibility and clinical phenotypes in epilepsy". 

Competing interests statement 

The authors declare no competing interests. 



FURTHER INFORMATION 

gcblas: http://gcbias.org/ 

Genetic inference: http://www.genetic-inference.co.uk/ 

blog/2QQ9/ll/how-many-ancestors-share-Qur-dna/ 

On Genetics; http://ongenetic5.blogspot.CQ.uk/2Qll/Q2/ 

genetic-genealogy-and-single-segment.html 

Wellcome Trust Heterogeneous Stock Mouse Collection: 

mus.well.ox.ac.uk/mouse 

SUPPLEMENTARY INFORMATION 

See online article: SI (box) | 52 (box) | Si (box) | S4 (box) | 
S5 (box) I 56 (box) | S7 (box) | S8 (box) 

ALL LINKS ARE ACTIVE IN THE ONLINE PDF 



12 I ADVANCE ONLINE PUBLICATION 



© 2014 Macmillan Publishers Limited. All rights reserved 



www.nature.com/revlews/genetics 



